Text-to-speech inspired duration modeling for improved whole-word acoustic models

نویسندگان

  • Keith Kintzley
  • Aren Jansen
  • Hynek Hermansky
چکیده

In the construction of whole-word acoustic models, we have previously demonstrated substantial gains by using MAP estimation to introduce a simple prior model of phonetic timing. Based solely on the word’s phonetic (dictionary) pronunciation, this simple model included no information about the individual durations of constituent phones. However, the problem of modeling segmental duration has long been studied in the textto-speech (TTS) community. We draw upon this work to develop a classification and regression tree (CART) approach for constructing prior models of phonetic timing which considers factors such as syllable stress, syllable position, adjacent phone class and voicing. This improved prior model closes 33% of the gap in keyword spotting performance between highly supervised whole-word models and those estimated without any examples.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Modeling word durations

We describe a new method of modeling duration at word level. These duration models are easily trained from the acoustic training data and can be used to rescore N−best lists of recognition hypotheses. The models capture some of the well known durational effects such as prepausal lengthening. They incorporate a simple back off mechanism to handle unseen words during rescoring. Experiments with v...

متن کامل

Allophone-based acoustic modeling for Persian phoneme recognition

Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...

متن کامل

Modeling Word Duration for Better Speech Recognition

We describe a new method of modeling duration at word level. These duration models are easily trained from the acoustic training data and can be used to rescore N−best lists of recognition hypotheses. The models capture some of the well known durational effects such as prepausal lengthening. They incorporate a simple back off mechanism to handle unseen words during rescoring. Experiments with v...

متن کامل

Statistical prosodic modeling: from corpus design to parameter estimation

The increasing availability of carefully designed and collected speech corpora opens up new possibilities for the statistical estimation of formal multivariate prosodic models. At Apple Computer, statistical prosodic modeling exploits the Victoria corpus, recently created to broadly support ongoing speech synthesis research and development. This corpus is composed of five constituent parts, eac...

متن کامل

شبکه عصبی پیچشی با پنجره‌های قابل تطبیق برای بازشناسی گفتار

Although, speech recognition systems are widely used and their accuracies are continuously increased, there is a considerable performance gap between their accuracies and human recognition ability. This is partially due to high speaker variations in speech signal. Deep neural networks are among the best tools for acoustic modeling. Recently, using hybrid deep neural network and hidden Markov mo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013